Overview on the data set

## 'data.frame':    4898 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality_factor      : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      quality_factor
##  Min.   : 8.00   Min.   :3.000   3:  20        
##  1st Qu.: 9.50   1st Qu.:5.000   4: 163        
##  Median :10.40   Median :6.000   5:1457        
##  Mean   :10.51   Mean   :5.878   6:2198        
##  3rd Qu.:11.40   3rd Qu.:6.000   7: 880        
##  Max.   :14.20   Max.   :9.000   8: 175        
##                                  9:   5

The data set contains 4898 objects with 14 variables. As the quality score lies between 0 and 10, it makes sense to add it as a factor. To do so, I added another variable quality_factor. Some of the variable seem to have extreme outliers, which we should take into account when creating plots for them.

Univariate Plots Section

To get an overview of the 13 variables, creating a grid with distribution histograms seems to be the best way to start.

The output looks good, so I am now creating plots for each variable. To ignore the outliers, I will set the limit of the axis to the 99%-quantile when this is necessary.

Acidity values (g / dm^3)

These three plots show the distribution of the three types of acid values in the data set. While the maximum amount for citric.acid is 1.66 g / dm^3, which is more than 5 times the mean (0.32 g / dm^3). For that reason, I will set the limit of the x-axis to 0.85. I will add another plot showing the log transformed values for citric.acid:

Residual Sugar (g / dm^3)

The distribution of sugar looks skewed in the grid above. So I will choose a different bin width and set a limit. The summary of the data shows the minimum is 0.60 and the maximum is 65.80 g/dm^3. However, 75% of the values are below 9.90, which is a huge difference.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

For this variable, we can also use a log10 scale to get a better view on the distribution:

Interstingly enough, in this histogram we can see that there are two peaks around 1.5 and 10.

Chlorides (sodium chloride - g / dm^3)

This distribution has a long tail to the right, so I am setting a limit on the x-axis. The distribution is skewed left.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Free and total sulfur dioxide (mg / dm^3)

Both the distribution for free sulfur dioxides and total sulfur dioxides have a long tail to the right side, so I am discarding the outliers by setting a limit again.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

With a log10 transition, the distribution seem to be normal:

Density (g / cm^3)

The values for density are between 0.987 and 1.039, so they are on a very small scale. I am setting the binwidth to 0.0005. The distribution looks close to a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The distribution looks scattered, so I will try to get more information with a log10 transformation:

For this variable the log10 transformation does not provide us with more information. I assume the reason is that the values are on a very small scale.

pH

The pH value is normally distributed. No limits had to be set, so no outliers had to be removed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Sulphates (potassium sulphate - g / dm3)

The distribution of sulphates is screwed a little bit to the left side. I will set a limit

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Alcohol (% by volume)

The distribution of alcohol is screwed left and spread. There are no outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Quality (score between 0 and 10)

We are taking a look at the quality. As this variable was transformed to a factor, we use a barplot.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Univariate Analysis

What is the structure of your dataset?

The dataset is related to Portuguese wine. It has 4899 objects with 12 attributes.

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)

  2. volatile acidity (acetic acid - g / dm^3)

  3. citric acid (g / dm^3)

  4. residual sugar (g / dm^3)

  5. chlorides (sodium chloride - g / dm^3

  6. free sulfur dioxide (mg / dm^3)

  7. total sulfur dioxide (mg / dm^3)

  8. density (g / cm^3)

  9. pH

  10. sulphates (potassium sulphate - g / dm3)

  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

In the Univariate Plots section, I have created plots for each variable, but I haven’t found something very surprising. Most of the distributions look like a normal distribution. I cannot identify any correlations yet, but this will be done in the next section.

Based on the description given, the main interest is the quality of wine. The quality was graded by experts, so it will be interesting to find out what actually influenced their grading or rather if there is a variable that influences the grade in either a positive or a negative direction.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

With 11 input variables based on physicochemical tests and one output variable based on sensory data, there is enough to investigate.

Did you create any new variables from existing variables in the dataset?

I created the variable quality_factor to represent the quality in a factor format. This makes it possible to create barplots for this variable and also makes it easier to use it in the plots following in the next sections.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There was no need to tidy the dataset, there were no missing values.

I set several limits on the x-axis of the plots I created to avoid showing single outlier values. I do not know whether they were caused by measuring errors or if some wines have these extremly high/low values.

Something else I noticed is the distribution of the quality levels. There are no wines with a grade less than 3 and no wines with a score greater than 9. Most of them were graded as 6.

Bivariate Plots Section

The aim of this section is to find out about what influences the quality. I will start with a correlation matrix in order to check all variables.

The correlation plot above shows that quality does not correlate with many of the other variables. There is a moderate positive correlation between quality and alcohol (0.43) and a moderate negative correlation between quality and density (-0.31).

The highest correlation is between density and residual.sugar (0.84), and density and alcohol (-0.78).

Alcohol and Quality

The first plot shows that there is a positive moderate correlation between quality and alcohol (.43). However, the boxplot gives us information that we couldn’t see before: from quality factor 3 to 4 and from 4 to 5, the median alcohol level drops. This is against the correlation we found before, so we have to find out what is wrong here.

##        
##           3   4   5   6   7   8   9
##   0.987   0   0   2   0   5   1   0
##   0.988   0   0   0   9   8   0   0
##   0.989   0   2   4  70  66  24   0
##   0.99    0   8  31 170 132  28   3
##   0.991   2  16  40 241 185  31   1
##   0.992   3  13 154 284 156  26   0
##   0.993   2  25 166 290  88  21   0
##   0.994   3  27 203 235  67  15   0
##   0.995   2  25 166 247  49   9   0
##   0.996   1  16 219 190  23   3   0
##   0.997   3   8 144 160  17   0   1
##   0.998   2  17 169 155  42   8   0
##   0.999   0   4  80  95  22   6   0
##   1       2   2  65  37  20   1   0
##   1.001   0   0  11   7   0   2   0
##   1.002   0   0   3   3   0   0   0
##   1.003   0   0   0   2   0   0   0
##   1.01    0   0   0   2   0   0   0
##   1.039   0   0   0   1   0   0   0

This table gives us an explanation for this finding: there are relatively few wines graded with 3 and 4. For this reason, these values do have less influence on the correlation coefficient.

Quality and Density

Both plots show the negative correlation between quality and density. Similar to the previous plots, this is not very obvious for the lower quality levels. The level 5 wines even have the highest median for density.

Density and alcohol

The plot shows a trend that wines with a low density have more alcohol.

Density and Residual sugar

The plot shows a trend that wines with a high density have more residual sugar.

Alcohol by quality (cumulated)

This plot represent the two variables with the highest correlation. I have coloured them by factorizing the rounded value of alcohol. You can see a clear see how the colours are changing in the plot.

This plot shows the quantities of each quality factor for each level of alcohol. You can see that the blue part is growing from the left to the right. Only the leftmost bar has a distinct red part, which represents the lowest level.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As I decided to focus on the quality variable I paid special attention to the correlation between quality and other variables. There is a moderate positive correlation with alcohol (0.43) and a moderate negative correlation with density (-0.31).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As the correlation matrix indicates, most of the variables do not have a strong correlation with each other. For that reason, I did not take a closer look at them.

What was the strongest relationship you found?

The strongest I found is between density and residual.sugar (0.84), and density and alcohol (-0.78).

Multivariate Plots Section

Density and residual sugar

Density and alcohol

This is a plot for the two variables that correlate the most with alcohol. We can see that the points get a lighter colour the higher the density is and the lower the alcohol value.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The plots confirm the correlation I had found before.

Looking at the colours of both plots you can see the different correlations: while the first plot represents only moderate correlations, the second has a clearer colouring. Furthermore, it illustrates the difference between a positive and a negative correlation. While the points in the first plot (for density and residual sugar) are aligned on an axis from the lower left to the upper right corner, the second plot (for density and alcohol) is from the upper left to the lower right corner.

Were there any interesting or surprising interactions between features?

I think it is quite interesting to factorize values in order to use them as a category for plots. While this might not be a good solution for some values, it seems to be a suitable solution for the amount of alcohol.


Final Plots and Summary

Plot One

Description One

This scatterplot shows the distribution the variable residual.sugar on the x-axis and the variable density on the y-axis. The points are coloured by the rounded amount of alcohol. It’s obvious that the dominating colour is changing from purple in the lower left of the plot (wines with low density and low residual) to blue in the upper right. This means that the amount of alcohol is decreasing.

Plot Two

Description Two

For me, this is the most suprising plot in this report. While there is a moderate correlation between the quality and alcohol (0.43), it’s not possible to identify this correlation in the box plot. One could intuitively think that the boxes should be higher with each quality level as these variables correlate positively. This is not true for level 3, 4 and 5. On the contrary, they are even lower for these three levels.

As I have explained in the section above, there are relatively few wines graded with 3 and 4, so this plot is surprising, but it can be explained. However, it shows that it is often not enough to present data with only one plot. Showing only this boxplot could have lead to the wrong impression that wine with a low density has a middle quality level.

Plot Three

Description Three

This is a more experimental plot I have created to show how the composition of quality ratings for the different amounts of alcohol. It shows the amount of alcohol (rounded) as a factor on the x-axis and the quality level of each factor as a stacked bar. To do so, I have rounded the alcohol variable, so that we 7 levels now. This table shows the alcohol amount in the first row and the wines with this amount in the second.

## 
##    8    9   10   11   12   13   14 
##   14 1194 1527 1034  774  314   41

Similar to the distribution of the quality levels, there are also relatively few wines with a low and a high amount of alcohol. However, I think by showing them in this way, the plot gives a good overview.


Reflection

In the analysis of the data set I found two variables that influence the quality of wine significantly and took a closer look at two variables in the dataset, which had the highest correlation overall. The quality is an interesting value to look at because it is the only subjective one. So we can find out which measured value influence how experts evaluate the quality.

The data set was clean, the documentation was really good, so I did not run into any mayor problems. I have worked a lot with R in the past and have generated many plots for my own research. I would not consider myself as a professional, but I know what is possible and where I can find answers to my questions. However, I was not really happy that there were no high correlations in the data set, which would have made the analysis a bit smoother and would have given me more options to create interesting plots.

For future work, it would be interesting to create a model to predict the quality of wine by the measured values. It would be really interesting to find out which values a “perfect wine” would have. If we conducted this reasearch for a wine selling company, we could even try to create such a wine and let it be graded by the experts. I am not sure if it is possible to create a reliable model as some of the values have a very low correlation with quality, so they cannot be used for a prediction. Furthermore, as already mentioned above, the quality scores do not give a lot of information as only 7 grades were assigned in this data set.